Add stagingModeAutoFlushThreshold for staging mode batch control by anthony-murphy-agent · Pull Request #26577 · microsoft/FluidFramework

anthony-murphy-agent · 2026-02-27T00:17:58Z

Summary

Add stagingModeAutoFlushThreshold option to ContainerRuntimeOptions that controls automatic batch flushing during staging mode
When in staging mode, suppress turn-based/async flush scheduling until the accumulated batch reaches the threshold op count
Default threshold: 1000 ops (tied to largeBatchThreshold constant) — incoming ops always break the batch regardless
Wrap exitStagingMode in a PerformanceEvent reporting duration, exit method, batch count, and batches at or over threshold
Remove the DisableFlushBeforeProcess kill-bit flag (split out to Remove DisableFlushBeforeProcess feature flag #26770)

Default Justification (from production telemetry)

Copy-paste operations routinely produce batches of 1000+ ops (435K instances over 30 days via GroupLargeBatch telemetry)
All observed large batches are non-reentrant single-turn batches from normal user actions (not reconnection replay — reconnect preserves batch boundaries)
Receivers on modern Fluid versions (2.74+) handle 1000-op batches without jank (p99 processing duration ~5ms)
1000 matches the existing "large batch" telemetry threshold in OpGroupingManager
The threshold only affects cross-turn accumulation; single-turn operations (like paste) are unaffected

Key Design Points

Only affects scheduleFlush() — direct flush() calls (incoming ops, connection changes, stashing, exit staging mode) bypass the threshold entirely
No effect outside staging mode
Exposed on the public ContainerRuntimeOptions interface (@legacy @beta) with forwardCompat: false type validation break acknowledged — consumers using Partial<ContainerRuntimeOptions> (the typical pattern via IContainerRuntimeOptions) are unaffected
Config override (Fluid.ContainerRuntime.StagingModeAutoFlushThreshold) > runtime option > default (1000)

Telemetry

ExitStagingMode perf event: duration, exitMethod (commit/discard), autoFlushThreshold, batches, batchesAtOrOverThreshold
GroupLargeBatch threshold changed from >= to > so staging-mode auto-flush batches (exactly at threshold) don't trigger the event

Test plan

🤖 Generated with Claude Code

… mode During staging mode, the runtime flushes ops into separate staged batches at every JS turn boundary. This means consumers like Word that want to accumulate ops across many turns into fewer, larger batches get fragmented results. Add a `stagingModeMaxBatchOps` option to `ContainerRuntimeOptionsInternal` that suppresses automatic (turn-based/async) flush scheduling during staging mode until the accumulated batch reaches the specified op count. Incoming ops still break the current batch regardless (they change the reference sequence number via direct flush() calls that bypass scheduleFlush()). Default: 1000 ops. This was chosen based on production telemetry analysis: - Copy-paste operations routinely produce batches of 1000+ ops (435K instances of >=1000 ops observed over 30 days via GroupLargeBatch telemetry) - All are non-reentrant single-turn batches from normal user actions - Receivers on modern Fluid versions (2.74+) handle these without jank (p99 processing duration ~5ms for typical batches) - 1000 matches the existing "large batch" telemetry threshold in OpGroupingManager - The threshold only affects cross-turn accumulation; single-turn operations (like paste) are unaffected since all ops are submitted synchronously Consumers can override: set to Infinity to only break batches on system events, or to a lower value for tighter batch control. Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

packages/runtime/container-runtime/src/containerRuntime.ts

The option controls when automatic flush scheduling kicks in, not a cap on batch size. A batch can contain far more ops if a single synchronous turn pushes many ops past the threshold (e.g. paste). The new name makes it clear that only automatic/scheduled flushes are affected, not direct flush calls from incoming ops, connection changes, or exit staging mode. Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address PR feedback: - Field is now always `number` (not `number | undefined`), with the default applied at construction time - Add config override via Fluid.ContainerRuntime.StagingModeAutoFlushThreshold for runtime tuning without code changes - Config override takes precedence over runtime option, which takes precedence over the default (1000) Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds an internal runtime option to control when automatic flush scheduling resumes during staging mode, allowing cross-turn accumulation of staged ops up to a configurable op-count threshold.

Changes:

Introduces stagingModeMaxBatchOps?: number on ContainerRuntimeOptionsInternal with a default of 1000 ops.
Updates ContainerRuntime.scheduleFlush() to suppress turn-based/async flush scheduling in staging mode until the threshold is reached.
Adds staging-mode threshold tests and excludes the option from doc-schema-affecting runtime options.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
packages/runtime/container-runtime/src/containerRuntime.ts	Adds option + default constant and gates `scheduleFlush()` during staging mode based on accumulated op count.
packages/runtime/container-runtime/src/test/containerRuntime.spec.ts	Adds tests intended to validate staging-mode batching behavior under/at threshold and with incoming ops.
packages/runtime/container-runtime/src/containerCompatibility.ts	Omits `stagingModeMaxBatchOps` from doc-schema affecting runtime options.

packages/runtime/container-runtime/src/containerRuntime.ts

packages/runtime/container-runtime/src/test/containerRuntime.spec.ts

packages/runtime/container-runtime/src/containerRuntime.ts

- Fix comment accuracy: scheduleFlush threshold triggers at "reaches or exceeds", not just "exceeds" - Fix maybeFlushPartialBatch comment: by default it throws on unexpected sequence number changes, only forces a flush when partial-batch flushing is enabled via Fluid.ContainerRuntime.DisableFlushBeforeProcess - Strengthen threshold and incoming-op tests: assert that the outbox is actually emptied (mainBatchMessageCount drops to 0) rather than only checking that nothing was submitted to the wire Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

packages/runtime/container-runtime/src/containerRuntime.ts

markfields · 2026-03-17T21:34:08Z

I left some individual comments about logging in there, but thinking about it holistically, I think we want:

Log when this threshold is hit
Log when a batch is large in bytes (regardless of the main change here)
Be harmonious with existing "GroupLargeBatch" event (PS - Think about hosts that doesn't enable Grouped Batching)
Do we also want telemetry about large blobAttach batches? (regardless of the main change here)

markfields · 2026-03-17T21:39:05Z

@anthony-murphy Claude and I chatted about test cases (and I mentioned the ones you included in the PR description). Here's the test plan that came out of it, worth browsing, there are some important corner cases here:

stagingModeAutoFlushThreshold — Test Plan

Already covered

Scenario	Status
Ops accumulate under threshold	✅
Ops flush when threshold reached	✅
Incoming ops break batch regardless of threshold	✅
Exit via `commitChanges` flushes remaining ops	✅
No effect outside staging mode	✅
Default threshold (1000) suppresses turn-based flushing	✅

To add

1. Direct-flush codepaths break the batch during staging mode

These tests demonstrate that every codepath calling flush() directly still breaks
the accumulated batch, regardless of threshold. Each should verify: outbox is drained,
ops move to PSM as a staged batch, and no ops are sent to the wire.

1a. Incoming runtime op breaks batch

Already covered by existing test ("incoming ops break the batch regardless of
threshold"). Listed here for completeness.

1b. Incoming non-runtime op breaks batch

Same as 1a but with a non-runtime (signal/system) op arriving. The process() path
calls flush() unconditionally (line 3066) unless skipSafetyFlushDuringProcessStack
is set. Verify a non-runtime inbound message also drains the outbox mid-accumulation.

1c. Connection state change (reconnect) breaks batch

Enter staging mode, submit ops (under threshold, sitting in outbox)
Simulate disconnect then reconnect (canSendOps transitions true → false → true)
The reconnect flush (line 2962) should drain the outbox before replayPendingStates
Verify the accumulated ops became a staged batch in PSM
Verify pre-staged ops are resubmitted correctly after reconnect

1d. `enterStagingMode` flushes any pending outbox contents

This covers the edge case where ops were submitted in the same JS turn before
enterStagingMode() is called (so a flush was scheduled but hasn't fired yet).

Submit ops (not yet flushed — still in outbox)
Call enterStagingMode() in the same turn
Verify those ops were flushed as a non-staged batch (they predate staging mode)
Submit more ops while in staging mode
commitChanges() — verify only the post-entry ops are staged

2. Exit via `discardChanges` flushes outbox before rollback

Same shape as the existing commitChanges exit test but using discardChanges.
Verify outbox is drained and rolled-back ops match what was submitted.

3. IdAllocation + reconnect while in staging mode (`b4e1fd1` interaction)

This is the highest-risk gap.

The fix in b4e1fd1 added scheduleFlush() after submitIdAllocationOpIfNeeded
during replayPendingStates to ensure the IdAllocation op is flushed before new ops
with different refSeqs arrive. With threshold suppression, that scheduleFlush() will
now return early if in staging mode and under threshold — potentially re-introducing
the original bug.

Test scenario:

Enter staging mode
Disconnect, generate a compressed ID (queued in idAllocationBatch)
Reconnect — replayPendingStates submits IdAllocation op + calls scheduleFlush()
Simulate remote op arriving (bumps refSeq)
Generate 2nd compressed ID + submit a data store op
Verify no outboxSequenceNumberCoherencyCheck error

If this test fails, the fix is to exempt the scheduleFlush() call in
replayPendingStates from threshold suppression (e.g., pass a force flag, or
call flush() directly instead of scheduleFlush()).

4. Reconnect resubmits pre-staged batches while threshold is active

Verify that ops submitted before entering staging mode are correctly resubmitted
on reconnect, and that the threshold does not interfere with their resubmission
(since resubmitted pre-staging batches go through replayPendingStates, not
scheduleFlush()).

5. Config override > runtime option > default

Single test: create runtime with both a config override and a runtime option set to
different values. Verify the config override wins. Then verify runtime option wins
over default when no config override is present.

The kill-bit switch for the flush-before-process simplification has been in production long enough to confirm correctness. Remove the flag, hardcode the default behavior (flush before process), and clean up the partial-batch flushing code path that was only reachable when the flag was enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extract a shared largeBatchThreshold constant from OpGroupingManager and use it for the staging-mode auto-flush default, so both values stay in sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-control

Move stagingModeAutoFlushThreshold from ContainerRuntimeOptionsInternal to the public ContainerRuntimeOptions interface so consumers can configure it. Make it required (with a default of 1000) to match the fully-required convention of the options interfaces. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add new unit tests for stagingModeAutoFlushThreshold: - discardChanges flushes outbox before rollback - enterStagingMode flushes pending outbox as non-staged - config override > runtime option > default precedence (2 tests) - incoming non-runtime op breaks batch during staging mode - reconnect breaks batch during staging mode Also fix ContainerLoadStats telemetry expectations to include stagingModeAutoFlushThreshold, update type validation for the new public API surface, and remove unused typeFromBatchedOp helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Test 3 (highest risk): Verify that IdAllocation ops submitted during replayPendingStates in staging mode are properly flushed by the "op" handler before new ops with different refSeqs arrive, preventing the outboxSequenceNumberCoherencyCheck error. Test 4: Verify that pre-staged batches are correctly resubmitted on reconnect while the threshold is active, and that staged changes can still be committed afterward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Emit a telemetry event when the staging mode auto-flush threshold is reached, including the threshold value and current batch message count. This helps operators distinguish threshold-triggered flushes from other flush causes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use PerformanceEvent.timedExec to measure exitStagingMode duration and report autoFlushCount, autoFlushThreshold, and exitMethod. The perf event is passed to the discardOrCommit callback so callers can add properties in the future. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use PerformanceEvent.timedExec to measure exitStagingMode, reporting: - exitMethod (commit/discard) - autoFlushCount and autoFlushThreshold - batches count and batchesOverThreshold (via reportProgress) Both commit and discard paths return batchInfo arrays (deduplicated by CSN) so exitStagingMode can compute batch stats uniformly. Also make replayPendingStates return the replayed batchInfo array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The flag was removed in this PR, so the test passing it is now redundant — it behaves identically to the remaining test which covers flush-before-process behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-control

github-actions · 2026-03-25T23:22:18Z

🔗 No broken links found! ✅

Your attention to detail is admirable.

linkcheck output


> fluid-framework-docs-site@0.0.0 ci:check-links /home/runner/work/FluidFramework/FluidFramework/docs
> start-server-and-test "npm run serve -- --no-open" 3000 check-links

1: starting server using command "npm run serve -- --no-open"
and when url "[ 'http://127.0.0.1:3000' ]" is responding with HTTP status code 200
running tests using command "npm run check-links"


> fluid-framework-docs-site@0.0.0 serve
> docusaurus serve --no-open

[SUCCESS] Serving "build" directory at: http://localhost:3000/

> fluid-framework-docs-site@0.0.0 check-links
> linkcheck http://localhost:3000 --skip-file skipped-urls.txt

Crawling...

Stats:
  272202 links
    1863 destination URLs
    2108 URLs ignored
       0 warnings
       0 errors

anthony-murphy requested a review from Copilot February 27, 2026 00:21

Copilot started reviewing on behalf of anthony-murphy February 27, 2026 00:22 View session

anthony-murphy reviewed Feb 27, 2026

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

anthony-murphy-agent and others added 2 commits February 26, 2026 16:24

Copilot AI reviewed Feb 27, 2026

View reviewed changes

anthony-murphy requested a review from Copilot February 27, 2026 01:16

Copilot started reviewing on behalf of anthony-murphy February 27, 2026 01:17 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

anthony-murphy requested a review from markfields March 9, 2026 17:25

anthony-murphy changed the title ~~Add stagingModeMaxBatchOps for staging mode batch control~~ Add stagingModeAutoFlushThreshold for staging mode batch control Mar 15, 2026

markfields reviewed Mar 17, 2026

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

anthony-murphy mentioned this pull request Mar 18, 2026

Remove DisableFlushBeforeProcess feature flag #26770

Merged

3 tasks

anthony-murphy and others added 10 commits March 18, 2026 13:56

Merge branch 'remove-disable-flush-before-process' into staging-batch…

869a38c

…-control

Merge branch 'remove-disable-flush-before-process' into staging-batch…

b026458

…-control

microsoft deleted a comment from anthony-murphy-agent Mar 19, 2026

anthony-murphy marked this pull request as ready for review March 19, 2026 00:31

microsoft deleted a comment from azure-pipelines bot Mar 25, 2026

markfields approved these changes Mar 25, 2026

View reviewed changes

anthony-murphy self-requested a review March 25, 2026 23:59

anthony-murphy approved these changes Mar 25, 2026

View reviewed changes

anthony-murphy merged commit 5f9df7b into microsoft:main Mar 25, 2026
34 checks passed

anthony-murphy deleted the staging-batch-control branch March 25, 2026 23:59

github-actions bot mentioned this pull request Mar 26, 2026

[code-simplifier] Refactor staging-mode threshold bookkeeping #26852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stagingModeAutoFlushThreshold for staging mode batch control#26577

Add stagingModeAutoFlushThreshold for staging mode batch control#26577
anthony-murphy merged 30 commits intomicrosoft:mainfrom
anthony-murphy-agent:staging-batch-control

anthony-murphy-agent commented Feb 27, 2026 •

edited by anthony-murphy

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markfields commented Mar 17, 2026

Uh oh!

markfields commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

anthony-murphy-agent commented Feb 27, 2026 • edited by anthony-murphy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Default Justification (from production telemetry)

Key Design Points

Telemetry

Test plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markfields commented Mar 17, 2026

Uh oh!

markfields commented Mar 17, 2026

stagingModeAutoFlushThreshold — Test Plan

Already covered

To add

1. Direct-flush codepaths break the batch during staging mode

1a. Incoming runtime op breaks batch

1b. Incoming non-runtime op breaks batch

1c. Connection state change (reconnect) breaks batch

1d. enterStagingMode flushes any pending outbox contents

2. Exit via discardChanges flushes outbox before rollback

3. IdAllocation + reconnect while in staging mode (b4e1fd1 interaction)

4. Reconnect resubmits pre-staged batches while threshold is active

5. Config override > runtime option > default

Uh oh!

github-actions bot commented Mar 25, 2026

linkcheck output

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anthony-murphy-agent commented Feb 27, 2026 •

edited by anthony-murphy

Loading

1d. `enterStagingMode` flushes any pending outbox contents

2. Exit via `discardChanges` flushes outbox before rollback

3. IdAllocation + reconnect while in staging mode (`b4e1fd1` interaction)